Intro

I get asked this question a lot as a data scientist working with clinicians and medical students. You, as a clinician, are a busy person, but you are curious and want to contribute to the future of technology in healthcare. You may have heard of Python and R as being the de-facto standard languages for data science. However, you, just like most people entering data science across the world are curious as to which language you should use.

Here is a summary of my thoughts on this whole debate. For reference, I’ve been programming in R and Python for about 5 years now, specifically within the realm of data science in health care. Before I get to the actual content, I want to say the following:

  • Both of these languages are fully capable of doing data science
  • Neither language is categorically better than another

That being said, there are differences between the two languages and those differences may contribute to which you choose. If you are lazy and don’t want to read the whole article:

tl;dr

  • If you are looking for a language to rapidly do data manipulation, visualization, reporting, and prototyping dashboards, pick R.
  • If you are looking to implement anything custom or want to put your analysis in production or need to interface with other peoples’ code, pick Python
  • If you want to do something with neural networks, pick Python
  • If you want the latest hypothesis tests or statistical analysis, pick R
  • If you want to become a flexible data scientist, you should eventually learn both
  • If you have no programming background, and aren’t that interested in getting really good at programming, pick R
  • If you have a programming background or want to become proficient at programming in general, pick Python

Also, there is an infographic from datacamp that shows some of the differences between R and Python here

However, I don’t agree with some of the points made in the infographic, so take it with a grain of salt.

Structure

What I plan to do over the course of this piece is to highlight the differences between R and Python and give my opinion about which language performs which function better. Here’s a rough outline:

  • Language ecosystem and community
  • Data Cleaning
  • Visualization
  • Statistics and Machine Learning
  • Reporting
  • Web-Applications
  • Development Environments

First, a discussion of ecosystems

It might seems strange to begin with a discussion of add-ons and the community behind the two languages without even talking about the languages themselves to begin with. However, the community behind a programming language is actually one of the most important metrics of success and general usability of a language. Add-ons, known as libraries or packages, can turn languages that are frustrating to program in to languages that are a pleasure to write. My thoughts about the ecosystems of the two languages can be summarized quite succintly:

R’s ecosystem is very user-friendly and is specialized towards data manipulation, visualization, and statistical inference. In addition, R is primarily a data science language first and foremost, meaning most if not all of the packages are designed to fit within a data science framework.

Python’s data science ecosystem is also quite well developed. However, they sometimes are not as easy to use as R’s packages. That being said, the trade-off is that python is much more flexible language than R. Because python’s origins are as a general purpose language, it offers some functionality that is harder to replicate in R.

As a quick note, because this is something that comes up a lot: If you are looking to implement custom machine learning models, python is a better environment to do so. Most of the cutting-edge deep learning research is conducted in python. R has more packages geared towards complicated statistical techniques, such as those that arise in genetics research. This actually reflects the main differences between users between the two languages. Traditionally, academics and researchers have preferred to use R due to all of the statistical packages that are available. People coming from software engineering backgrounds tend to prefer python because of its flexibility as a general-purpose programming language.

With that, let’s dive into examples of functionality in both languages so you can see for yourself which you prefer:

Data Manipulation and Munging

Obviously, one of the things that you need to be able to do in either language is to work with the data itself. Let’s start by reading in some data!

First up, R:

Reading in Data

diabetes_data <- read.csv("./dataset_diabetes/diabetic_data.csv")
head(diabetes_data) 
##   encounter_id patient_nbr            race gender     age weight
## 1      2278392     8222157       Caucasian Female  [0-10)      ?
## 2       149190    55629189       Caucasian Female [10-20)      ?
## 3        64410    86047875 AfricanAmerican Female [20-30)      ?
## 4       500364    82442376       Caucasian   Male [30-40)      ?
## 5        16680    42519267       Caucasian   Male [40-50)      ?
## 6        35754    82637451       Caucasian   Male [50-60)      ?
##   admission_type_id discharge_disposition_id admission_source_id
## 1                 6                       25                   1
## 2                 1                        1                   7
## 3                 1                        1                   7
## 4                 1                        1                   7
## 5                 1                        1                   7
## 6                 2                        1                   2
##   time_in_hospital payer_code        medical_specialty num_lab_procedures
## 1                1          ? Pediatrics-Endocrinology                 41
## 2                3          ?                        ?                 59
## 3                2          ?                        ?                 11
## 4                2          ?                        ?                 44
## 5                1          ?                        ?                 51
## 6                3          ?                        ?                 31
##   num_procedures num_medications number_outpatient number_emergency
## 1              0               1                 0                0
## 2              0              18                 0                0
## 3              5              13                 2                0
## 4              1              16                 0                0
## 5              0               8                 0                0
## 6              6              16                 0                0
##   number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1                0 250.83      ?      ?                1          None
## 2                0    276 250.01    255                9          None
## 3                1    648    250    V27                6          None
## 4                0      8 250.43    403                7          None
## 5                0    197    157    250                5          None
## 6                0    414    411    250                9          None
##   A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1      None        No          No          No             No          No
## 2      None        No          No          No             No          No
## 3      None        No          No          No             No          No
## 4      None        No          No          No             No          No
## 5      None        No          No          No             No          No
## 6      None        No          No          No             No          No
##   acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1            No        No        No          No           No            No
## 2            No        No        No          No           No            No
## 3            No    Steady        No          No           No            No
## 4            No        No        No          No           No            No
## 5            No    Steady        No          No           No            No
## 6            No        No        No          No           No            No
##   acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1       No       No           No         No      No          No      No
## 2       No       No           No         No      No          No      Up
## 3       No       No           No         No      No          No      No
## 4       No       No           No         No      No          No      Up
## 5       No       No           No         No      No          No  Steady
## 6       No       No           No         No      No          No  Steady
##   glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1                  No                  No                       No
## 2                  No                  No                       No
## 3                  No                  No                       No
## 4                  No                  No                       No
## 5                  No                  No                       No
## 6                  No                  No                       No
##   metformin.rosiglitazone metformin.pioglitazone change diabetesMed
## 1                      No                     No     No          No
## 2                      No                     No     Ch         Yes
## 3                      No                     No     No         Yes
## 4                      No                     No     Ch         Yes
## 5                      No                     No     Ch         Yes
## 6                      No                     No     No         Yes
##   readmitted
## 1         NO
## 2        >30
## 3         NO
## 4         NO
## 5         NO
## 6        >30

We’ll use this dataset to illustrate some of the differences between using the base language and using packages to help us. Let’s say we want to find which combinations of race and gender have more medications on average and sort them according to that number. In the base R language, we could do that like this:

Simple Aggregation

# Base language
result <- aggregate(diabetes_data$num_medications, by=list(diabetes_data$race, diabetes_data$gender), FUN=mean)

result <- result[order(result$x, decreasing = TRUE),]

head(result)
##            Group.1         Group.2        x
## 14           Other Unknown/Invalid 22.00000
## 4        Caucasian          Female 16.46434
## 10       Caucasian            Male 16.09105
## 7                ?            Male 15.88576
## 1                ?          Female 15.74492
## 2  AfricanAmerican          Female 15.63446

Some of the syntax can seem foreign at first glance. Luckily, the dplyr package makes this operation trivial, and more importantly, readable

library(dplyr) # Import the dplyr library
result <- diabetes_data %>% 
            group_by(race, gender) %>% 
            summarize(average_medications = mean(num_medications)) %>% 
            arrange(desc(average_medications))

head(result)
## # A tibble: 6 x 3
## # Groups:   race [4]
##   race            gender          average_medications
##   <fct>           <fct>                         <dbl>
## 1 Other           Unknown/Invalid                22  
## 2 Caucasian       Female                         16.5
## 3 Caucasian       Male                           16.1
## 4 ?               Male                           15.9
## 5 ?               Female                         15.7
## 6 AfricanAmerican Female                         15.6

In general, many of the packages that are available in R are user-friendly and powerful. The top packages are extremely well-designed in large part due to Hadley Wickham, who is a famous contributor to the R ecosystem and has really ensured that it has cemented its place as a core data science language.

Filtering

filtering_example <- diabetes_data %>% 
                        filter(gender == "Female") %>%
                        filter(race == "Caucasian") %>%
                        filter(num_medications > 8)

head(filtering_example)
##   encounter_id patient_nbr      race gender      age weight
## 1       149190    55629189 Caucasian Female  [10-20)      ?
## 2        12522    48330783 Caucasian Female  [80-90)      ?
## 3        15738    63555939 Caucasian Female [90-100)      ?
## 4        40926    85504905 Caucasian Female  [40-50)      ?
## 5        84222   108662661 Caucasian Female  [50-60)      ?
## 6       183930   107400762 Caucasian Female  [80-90)      ?
##   admission_type_id discharge_disposition_id admission_source_id
## 1                 1                        1                   7
## 2                 2                        1                   4
## 3                 3                        3                   4
## 4                 1                        3                   7
## 5                 1                        1                   7
## 6                 2                        6                   1
##   time_in_hospital payer_code      medical_specialty num_lab_procedures
## 1                3          ?                      ?                 59
## 2               13          ?                      ?                 68
## 3               12          ?       InternalMedicine                 33
## 4                7          ? Family/GeneralPractice                 60
## 5                3          ?             Cardiology                 29
## 6               11          ?                      ?                 42
##   num_procedures num_medications number_outpatient number_emergency
## 1              0              18                 0                0
## 2              2              28                 0                0
## 3              3              18                 0                0
## 4              0              15                 0                1
## 5              0              11                 0                0
## 6              2              19                 0                0
##   number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1                0    276 250.01    255                9          None
## 2                0    398    427     38                8          None
## 3                0    434    198    486                8          None
## 4                0    428 250.43  250.6                8          None
## 5                0    682    174    250                3          None
## 6                0    V57    715    V43                8          None
##   A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1      None        No          No          No             No          No
## 2      None        No          No          No             No          No
## 3      None        No          No          No             No          No
## 4      None    Steady          Up          No             No          No
## 5      None        No          No          No             No          No
## 6      None        No          No          No             No          No
##   acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1            No        No        No          No           No            No
## 2            No    Steady        No          No           No            No
## 3            No        No        No          No           No        Steady
## 4            No        No        No          No           No            No
## 5            No        No    Steady          No           No            No
## 6            No        No        No          No           No            No
##   acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1       No       No           No         No      No          No      Up
## 2       No       No           No         No      No          No  Steady
## 3       No       No           No         No      No          No  Steady
## 4       No       No           No         No      No          No    Down
## 5       No       No           No         No      No          No      No
## 6       No       No           No         No      No          No      No
##   glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1                  No                  No                       No
## 2                  No                  No                       No
## 3                  No                  No                       No
## 4                  No                  No                       No
## 5                  No                  No                       No
## 6                  No                  No                       No
##   metformin.rosiglitazone metformin.pioglitazone change diabetesMed
## 1                      No                     No     Ch         Yes
## 2                      No                     No     Ch         Yes
## 3                      No                     No     Ch         Yes
## 4                      No                     No     Ch         Yes
## 5                      No                     No     No         Yes
## 6                      No                     No     No          No
##   readmitted
## 1        >30
## 2         NO
## 3         NO
## 4        <30
## 5         NO
## 6        >30

What about python?

In python, the standard library that is used to work with data is known as pandas.

Reading in Data

import pandas as pd
diabetes_data = pd.read_csv("./dataset_diabetes/diabetic_data.csv")
print(diabetes_data.head())
##    encounter_id  patient_nbr    ...     diabetesMed readmitted
## 0       2278392      8222157    ...              No         NO
## 1        149190     55629189    ...             Yes        >30
## 2         64410     86047875    ...             Yes         NO
## 3        500364     82442376    ...             Yes         NO
## 4         16680     42519267    ...             Yes         NO
## 
## [5 rows x 50 columns]

Simple Aggregation

diabetes_python = (diabetes_data
            .groupby(['race', 'gender'], as_index = False)['num_medications']
            .mean()
            .sort_values(by = ['num_medications'], ascending = False))
print(diabetes_python.head(6))
##                race           gender  num_medications
## 13            Other  Unknown/Invalid        22.000000
## 7         Caucasian           Female        16.464335
## 8         Caucasian             Male        16.091046
## 1                 ?             Male        15.885764
## 0                 ?           Female        15.744925
## 3   AfricanAmerican           Female        15.634465

pandas is a very powerful, but complex package. There are a lot of parameters to think about because of how flexible the package is. For example, things like the as_index = False call above can be hard to find if you don’t know what you’re looking for. To find that particular case, you would have to look at the documentation for the groupby function, which is located here. I would recommend that you take a look to see how many ways you can modify this function with parameters. You can compare that to dplyr’s reference here.

Filtering

filtering_example = (diabetes_data[
                      (diabetes_data['gender'] == "Female") & 
                      (diabetes_data['race'] == "Caucasian") &
                      (diabetes_data['num_medications'] > 8)])
print(filtering_example.head())
##     encounter_id  patient_nbr    ...     diabetesMed readmitted
## 1         149190     55629189    ...             Yes        >30
## 8          12522     48330783    ...             Yes         NO
## 9          15738     63555939    ...             Yes         NO
## 12         40926     85504905    ...             Yes        <30
## 17         84222    108662661    ...             Yes         NO
## 
## [5 rows x 50 columns]

From personal experience, I would say that R, due to tools like dplyr and tidyr is a more friendly language for data manipulation and it is generally easier to figure out what you are doing. In addition, the pandas documentation is huge and can be intimidating to new users. As a comparison, the are about 10 main functions that you need to learn in dplyr, whereas pandas contains hundreds of functions that all serve different purposes.

Visualization

Both R and Python are great at visualization. I honestly would call it almost a toss-up here. ggplot2 is the standard plotting library in R and it is fantastic. Python’s visualization frameworks are a bit more fragmented. The traditional standard for scientific plotting is known as matplotlib whereas some prettier graphs can be made using seaborn. For that reason, R might have the slight edge, although the difference is minimal.

R

Let’s try an example in ggplot2!

library(ggplot2)
ggplot(data = diabetes_data) + 
  geom_boxplot(aes(x = interaction(gender, race), y = num_medications, fill = gender)) +
  labs(title = "Test Plot", x = "Gender + Race", y = "Number of Medications")

Ok, the x-ticks are annoying. Fixing it requires a bit of complexity, so I’ll go ahead and illustrate that as well. Oftentimes there are hiccups like this that can cause frustration when dealing with visualization, which happens in both python and R.

library(stringr) # Work with strings! (characters) 

# Full disclosure -- I found the solution at stackoverflow:
# https://stackoverflow.com/questions/50047331/only-show-one-part-of-interacting-x-variable-in-x-axis-labels-in-ggplot 
# Stackoverlow is the programming bible -- if you know how to properly phrase your question, chances is that someone has 
# already answered it. Think of it like uptodate but for programmers

make_labels <- function(labels) {
  result <- str_split(labels, "\\.")
  unlist(lapply(result, function(x) x[2]))
}

# This function takes in the 'labels' from the graph
# Then, it splits them after the period and takes the 2nd part, which is what we want here.
# the "\\." is what is known as a regular expression which is a great, but seeminlgy complex, tool for parsing strings
# For more on regular expressions, you can read about them here: https://www.regular-expressions.info

diabetes_viz <- ggplot(data = diabetes_data) + 
  geom_boxplot(aes(x = interaction(gender, race), y = num_medications, fill = gender)) +
  labs(title = "Test Plot", y = "Number of Medications") + scale_x_discrete(labels = make_labels, name = "Race") +      
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
diabetes_viz

Python

In python, the main plotting libraries are matplotlib and seaborn. Matplotlib is generally used for quick graphs to visualize some data that you aren’t going to present, since it doesn’t generally look as nice. Although matplotlib can be used to visualize data that does not come from a Pandas DataFrame, Pandas actually has some nice built-in tools to help the process.

import matplotlib.pyplot as plt
diabetes_python.boxplot(['num_medications'], by = ['race', 'gender'])

As you can see, however, this doesn’t really look good. Although matplotlib allows you to customize virtually every aspect of this figure, it can often be easier to just use seaborn, which is a way to create prettier visualizations:

import seaborn as sns
sns_example = sns.boxplot(x = 'race', y = 'num_medications', hue = 'gender', data = diabetes_python)

I made the figure bigger this time so the x-ticks wouldn’t run into each other, but there are custom ways of dealing with this in python as well.

Visualization can be tedious in any language just because of the sheer amount of customizability that is almost mandatory to make the figures exactly the way you want. Since visualization is so important, the libraries used to create them are generally very well-written and documented, so there isn’t really a wrong way to go here.

Machine Learning

Machine Learning is a hot topic in healthcare and most other industries. Both R and Python offer excellent tools for machine learning. The two main packages that are used for machine learning in R and Python are caret and Scikit-Learn, respectively. Although they differ in their approaches, both are very usable. For most of the common machine learning techniques that were invented prior to 2010, you can probably find an implementation in these two packages. That includes things like,

  • Linear Regression
  • Logistic Regression (regularized and not)
  • Decision Trees
  • Random Forests
  • Support Vector Machines
  • Multi-Layer Perceptrons (basic neural networks)

If you’re dealing with models like this, both python and R should be completely fine. However, if you want to start playing around with custom models, this is where python’s ecosystem make it better. Almost no deep learning work occurs in R, because Tensorflow, which is google’s neural network library, supports python a lot better.

That being said, I wouldn’t make a decision on which language to choose simply due to the machine learning frameworks available in either language. If you want to do something with medical images or natural language processing, python is the better choice. Otherwise, they are essentially identical.

Statistics

R is better for statistics. Period. Most statisticians use R, and if there is a special type of hypothesis test or some other statistical analysis tool, it will most likely be implemented in R before it is implemented in python.

If you are looking to do basic statistical analysis, both languages are fine. As soon as you venture into the world of custom statistical procedures, you’re better off learning R.

Reporting on your work

One underappreciated aspect of programming languages for data science is how to present your work. Here, I think R is the winner simply because of how easy to use it is. R has something known as R Markdown, which this entire article is written in! It allows you to have in-line code like you’ve seen and it has a bunch of bells and whistles for showing your work afterwards. For example, I can include interactive visualizations natively into these documents!

# You can hover here for some cool effects!
library(plotly) # package for interactive visualizations

interactive_df <- ggplotly(diabetes_viz)
interactive_df

Python has its own version of this type of framework, known as Jupyter Notebooks. Actually, Jupyter Notebooks can work with R also. However, they’re not as easy to use, in my opinion. At this point in their development, both Rmarkdown and Jupyter Notebooks have similar functionality, it all comes down ease of use. You can find out more about jupyter notebooks here.

Web-Apps

What if you want to present a web application of your analysis? Here, I think R’s web-app framework, known as shiny is great. It allows you to rapidly spin up applications that feature interactive visualizations that offer easy hosting. Python’s versions of this are much more fragmented, and may require more pre-requisite knowledge and more of a programming background to set up. There are general web-app frameworks like django and flask, which are definitely not suited to beginning data science programmers. dash, which is made by plotly for python, offers similar functionality to shiny, but it is nowhere near as robust. You can find really cool examples of modules that you can integrate seamlessly into shiny apps here. Here are some links that you can use to check out for examples of shiny apps and dash apps, which I would say is a fair comparison:

shiny dash

R, with its Shiny infrastructure, definitely wins in terms of ease-of-use here.

Installing packages

The two languages differ quite a bit when it comes to installation of packages/libraries. In R, you can install any package you want by simply opening an R console and typing install.packages('<name of your package>'). In python, you do this from the command line, and there are lots of different ways to install packages. For example, two common package managers are conda and pip. The trade-off here is pretty clear. Python’s package management is more fragmented, but gives you greater control about what versions of packages you have installed. So for example, let’s say you’re using python version 2.7 and you want pandas version 1.10. That is something that you can easily do in python. R, however, doesn’t really encourage this as much as Python. That may be due to the fact that R updates rarely break old code (in my experience). In addition, most code written in prior versions of packages continue to function down the line. However, if you are putting some code into production, that is definitely not a viable solution. This whole scenario echoes the general differences behind R and Python’s main users: R tends to attract data scientists and researchers doing one-off analyses whereas python attracts engineers and people looking to do custom work and putting it into production.

Integrated Development Environment (IDE)

An IDE is where most people write their data science programs and scripts. For R, the main IDE is RStudio. RStudio is one of the best IDEs in existence. It has everything a modern IDE should have and is custom-tailored for R. For python, the main IDEs for Data Science are Jupyter Notebooks and Spyder. There are many others that are probably better suited for software development, but those are the main 2 for python. Jupyter Lab, which includes Jupyter Notebooks as a subset, is also getting better in terms of its functionality, but I would say that R wins hands-down in this department.

Rstudio can be downloaded here whereas instructions for installing jupyter notebooks can be found here

Conclusion

Hopefully this has exposed you a little bit to the differences between R and Python and ways that they are similar. In my opinion, it is difficult to make a wrong choice here. A lot of the process of learning tranfers directly between the two languages. If you are just getting started, I would suggest just making a decision and sticking with it. If you have any questions, or feedback (things you’d like to see added, etc.), feel free to email me at michael.gao@duke.edu